You should explore every data set numerically and visually prior to modeling it. The data exploration process will help you:
When we receive a data set for the first time, we often need to:
The process of preparing the data into a friendly format is known as “cleaning”.
We will systematically explore the penguins_raw data set
from the palmerpenguins package. To use the data:
data function,name of the data set to be loadeddata(penguins, package = "palmerpenguins")
This command actually loads two data sets:
penguins_raw: the data set we will be looking at.penguins: a simplified version.The penguins_raw data set provides data related to
various penguin species measured in the Palmer Archipelago (Antarctica),
originally provided by @GormanEtAl2014.
The data set includes 344 observations of 17 variables. The variables are:
studyName: a character variable indicating
the expedition from which the data were collected.Sample Number: a numeric variable denoting
the continuous number sequence for each sample.Species: a character variable indicating
the penguin species.Region: a character variable denoting the
region of the Palmer LTER sampling grid the sample was obtained.Island: a character variable indicating
the island on which the penguin was observed.Stage: a character variable indicating the
reproductive stage of the observation.Individual ID: a character variable
indicating the unique identification number for each individual in the
data set.Clutch Completion: a character variable
indicating whether the study nest was observed with a “full clutch” of 2
eggs.Date Egg: a Date variable indicating the
date that the study nest was observed with 1 egg.Culman Length (mm): a numeric variable
indicating the length of the dorsal ridge of the penguin’s bill in
millimeters.Culmen Depth (mm): a numeric variable
indicating the indicating the depth of the dorsal ridge of the penguin’s
bill in millimeters.Flipper Length (mm): a numeric variable
indicating the penguin’s flipper length in millimeters.Body Mass (g): a numeric variable
indicating the penguin’s body mass in grams.Sex: a character variable indicating the
penguin’s sex (FEMALE, MALE)Delta 15 N (o/oo): a numeric variable
indicating the ratio of stable isotopes 15N:14N.Delta 13 C (o/oo): a numeric variable
indicating the ratio of stable isotopes 15C:12C.Comments: a character variable providing
additional information about the observation.The str function provides a general overview of the
data’s structure.
str(penguins_raw, give.attr = FALSE)
## tibble [344 × 17] (S3: tbl_df/tbl/data.frame)
## $ studyName : chr [1:344] "PAL0708" "PAL0708" "PAL0708" "PAL0708" ...
## $ Sample Number : num [1:344] 1 2 3 4 5 6 7 8 9 10 ...
## $ Species : chr [1:344] "Adelie Penguin (Pygoscelis adeliae)" "Adelie Penguin (Pygoscelis adeliae)" "Adelie Penguin (Pygoscelis adeliae)" "Adelie Penguin (Pygoscelis adeliae)" ...
## $ Region : chr [1:344] "Anvers" "Anvers" "Anvers" "Anvers" ...
## $ Island : chr [1:344] "Torgersen" "Torgersen" "Torgersen" "Torgersen" ...
## $ Stage : chr [1:344] "Adult, 1 Egg Stage" "Adult, 1 Egg Stage" "Adult, 1 Egg Stage" "Adult, 1 Egg Stage" ...
## $ Individual ID : chr [1:344] "N1A1" "N1A2" "N2A1" "N2A2" ...
## $ Clutch Completion : chr [1:344] "Yes" "Yes" "Yes" "Yes" ...
## $ Date Egg : Date[1:344], format: "2007-11-11" ...
## $ Culmen Length (mm) : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
## $ Culmen Depth (mm) : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
## $ Flipper Length (mm): num [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
## $ Body Mass (g) : num [1:344] 3750 3800 3250 NA 3450 ...
## $ Sex : chr [1:344] "MALE" "FEMALE" "FEMALE" NA ...
## $ Delta 15 N (o/oo) : num [1:344] NA 8.95 8.37 NA 8.77 ...
## $ Delta 13 C (o/oo) : num [1:344] NA -24.7 -25.3 NA -25.3 ...
## $ Comments : chr [1:344] "Not enough blood for isotopes." NA NA "Adult not sampled." ...
An alternative to str is the glimpse
function from the dplyr package.
dplyr::glimpse(penguins_raw)
## Rows: 344
## Columns: 17
## $ studyName <chr> "PAL0708", "PAL0708", "PAL07…
## $ `Sample Number` <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 1…
## $ Species <chr> "Adelie Penguin (Pygoscelis …
## $ Region <chr> "Anvers", "Anvers", "Anvers"…
## $ Island <chr> "Torgersen", "Torgersen", "T…
## $ Stage <chr> "Adult, 1 Egg Stage", "Adult…
## $ `Individual ID` <chr> "N1A1", "N1A2", "N2A1", "N2A…
## $ `Clutch Completion` <chr> "Yes", "Yes", "Yes", "Yes", …
## $ `Date Egg` <date> 2007-11-11, 2007-11-11, 200…
## $ `Culmen Length (mm)` <dbl> 39.1, 39.5, 40.3, NA, 36.7, …
## $ `Culmen Depth (mm)` <dbl> 18.7, 17.4, 18.0, NA, 19.3, …
## $ `Flipper Length (mm)` <dbl> 181, 186, 195, NA, 193, 190,…
## $ `Body Mass (g)` <dbl> 3750, 3800, 3250, NA, 3450, …
## $ Sex <chr> "MALE", "FEMALE", "FEMALE", …
## $ `Delta 15 N (o/oo)` <dbl> NA, 8.94956, 8.36821, NA, 8.…
## $ `Delta 13 C (o/oo)` <dbl> NA, -24.69454, -25.33302, NA…
## $ Comments <chr> "Not enough blood for isotop…
The penguins_raw data has terrible variable names.
print(penguins_raw$`Flipper Length (mm)`, max = 10)
## [1] 181 186 195 NA 193 190 181 195 193 190
## [ reached getOption("max.print") -- omitted 334 entries ]
We will select only the variables that we will use in the future.
# select certain columns of penguins_raw, assign new name
penguins_clean <-
penguins_raw |>
subset(select = c("Species", "Island", "Culmen Length (mm)", "Culmen Depth (mm)", "Flipper Length (mm)", "Body Mass (g)", "Sex"))
To rename the columns of penguins_clean, we use the
names function.
# access column names and replace with new names
names(penguins_clean) <- c("species",
"island",
"bill_length",
"bill_depth",
"flipper_length",
"body_mass",
"sex")
# look at new column names
names(penguins_clean)
## [1] "species" "island" "bill_length"
## [4] "bill_depth" "flipper_length" "body_mass"
## [7] "sex"
Notable remaining issues with penguins_clean:
species, island, and sex
variables are categorical, but are represented as character
vectors.factor.transform function to convert the each
variable to a factor.# convert sex variable to factor, replace original object
penguins_clean <-
penguins_clean |>
transform(species = factor(species),
island = factor(island),
sex = factor(sex))
# view structure
dplyr::glimpse(penguins_clean)
## Rows: 344
## Columns: 7
## $ species <fct> Adelie Penguin (Pygoscelis adeliae)…
## $ island <fct> Torgersen, Torgersen, Torgersen, To…
## $ bill_length <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 3…
## $ bill_depth <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 1…
## $ flipper_length <dbl> 181, 186, 195, NA, 193, 190, 181, 1…
## $ body_mass <dbl> 3750, 3800, 3250, NA, 3450, 3650, 3…
## $ sex <fct> MALE, FEMALE, FEMALE, NA, FEMALE, M…
The levels of of species, island, and
sex are not formatted well.
# determine levels of species and sex
levels(penguins_clean$species)
## [1] "Adelie Penguin (Pygoscelis adeliae)"
## [2] "Chinstrap penguin (Pygoscelis antarctica)"
## [3] "Gentoo penguin (Pygoscelis papua)"
levels(penguins_clean$sex)
## [1] "FEMALE" "MALE"
We now change the levels of each variable in the same order they are printed above and confirm that the changes were successful.
# update factor levels of species and sex
levels(penguins_clean$species) <- c("adelie", "chinstrap", "gentoo")
levels(penguins_clean$sex) <- c("female", "male")
# confirm that changes took effect
dplyr::glimpse(penguins_clean)
## Rows: 344
## Columns: 7
## $ species <fct> adelie, adelie, adelie, adelie, ade…
## $ island <fct> Torgersen, Torgersen, Torgersen, To…
## $ bill_length <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 3…
## $ bill_depth <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 1…
## $ flipper_length <dbl> 181, 186, 195, NA, 193, 190, 181, 1…
## $ body_mass <dbl> 3750, 3800, 3250, NA, 3450, 3650, 3…
## $ sex <fct> male, female, female, NA, female, m…
Numerical exploration of a data set generally consists of computing various relevant statistics for each of the variables in a data set in order to summarize the data.
| numeric summary | variable type | summarizes | R function |
|---|---|---|---|
| mean | numeric |
center | mean |
| median | numeric |
center | median |
| variance | numeric |
spread | var |
| standard deviation | numeric |
spread | sd |
| interquartile range | numeric |
spread | quantile (modified) |
| quantiles | numeric |
center and spread | quantile |
| correlation | numeric |
similarity | cor |
| frequency distribution | factor |
counts | table |
| relative frequency distribution | factor |
proportions | table (modified) |
Numerical exploration of a set of numeric values usually
focuses on determining the:
It can also be useful to compute the correlation between two
numeric variables.
The sample mean and median are the most common statistics used to represent the “center” of a set of numeric values.
The sample mean or average is:
mean.The sample median is:
median.We compute the mean of the body_mass variable of the
penguins_clean data in the code below.
mean(penguins_clean$body_mass)
## [1] NA
Why is the result NA instead of a number?
# compute sample mean and median body_mass, ignoring NAs
mean(penguins_clean$body_mass, na.rm = TRUE)
## [1] 4201.754
median(penguins_clean$body_mass, na.rm = TRUE)
## [1] 4050
Question: The median is less than the mean (i.e., large values are pulling the mean in the positive direction), what might this tell us about the distribution?
The pth quantile (where \(0\leq p \leq 1\)) of a set of values is the value that separates the smallest \(100 p\)% of the values from the upper \(100(1-p)\)% of the values.
The quantile function is used to compute sample
quantiles.
Quantiles are useful quantifying:
quantile(penguins_clean$body_mass,
probs = c(0, 0.25, 0.5, 0.75, 1),
na.rm = TRUE)
## 0% 25% 50% 75% 100%
## 2700 3550 4050 4750 6300
Question: Q3 and the maximum are further from the median than Q1 and the minimum. Is this evidence that this variable may be positively skewed?
Spread is related to how far values are from each other.
The sample variance of a set of values is:
var function.The sample standard deviation is:
sd function.The larger the standard deviation or variance of a set of values, the more they vary from their sample mean.
The sample standard deviation and variance can be greatly affected by outliers.
The interquartile range is the difference between the 0.75 and 0.25 quantiles of a data set.
The minimum and maximum (in relation to the sample mean or median) can also be used to ascertain the spread of a data set.
min and max
functions/.We compute these measures of spread for the body_mass
variable below.
# sample variance
var(penguins_clean$body_mass, na.rm = TRUE)
## [1] 643131.1
# sample standard deviation
sd(penguins_clean$body_mass, na.rm = TRUE)
## [1] 801.9545
# interquartile range (names = FALSE removes text above the results)
quantile(penguins_clean$body_mass, probs = 0.75,
na.rm = TRUE, names = FALSE) -
quantile(penguins_clean$body_mass, probs = 0.25,
na.rm = TRUE, names = FALSE)
## [1] 1200
# minimum
min(penguins_clean$body_mass, na.rm = TRUE)
## [1] 2700
# maximum
max(penguins_clean$body_mass, na.rm = TRUE)
## [1] 6300
The correlation between two numeric variables quantifies
the strength and direction of their linear relationship.
The most common correlation statistic is Pearson’s correlation
statistic. If \(x_1, x_2, x_n\) and
\(y_1, y_2, \ldots, y_n\) are two sets
of numeric values, then the sample correlation statistic is
computed as \[r =
\frac{1}{n-1}\sum_{i=1}^n\left(\frac{x_i -
\bar{x}}{s_x}\right)\left(\frac{y_i - \bar{y}}{s_y}\right),\]
where:
cor function can be used to compute the sample
correlation between two numeric variables.Interpretation
In the code below, we compute the sample correlation between all
numeric variables in penguins_clean. We set
use = "pairwise.complete.obs" so that all
non-NA pairs of values are used in the calculation.
# determine whether each variable is numeric
num_col <- unlist(lapply(penguins_clean, is.numeric))
# observe results
num_col
## species island bill_length bill_depth
## FALSE FALSE TRUE TRUE
## flipper_length body_mass sex
## TRUE TRUE FALSE
# compute correlation of numeric variables
cor(penguins_clean[, num_col],
use = "pairwise.complete.obs")
## bill_length bill_depth flipper_length
## bill_length 1.0000000 -0.2350529 0.6561813
## bill_depth -0.2350529 1.0000000 -0.5838512
## flipper_length 0.6561813 -0.5838512 1.0000000
## body_mass 0.5951098 -0.4719156 0.8712018
## body_mass
## bill_length 0.5951098
## bill_depth -0.4719156
## flipper_length 0.8712018
## body_mass 1.0000000
bill_length and
body_mass is 0.87, so the larger a penguin is, the larger
its bill tends to be.bill_length and bill_depth is -0.24, so the
longer a bill becomes, the shallower (narrower) we expect the depth to
be.bill_depth and
body_mass is -0.47, so larger penguins tend to have
narrower bills.A frequency distribution or relative frequency distribution are useful numeric summaries of categorical data.
The table function returns a contingency table
summarizing the number of observations having each level. Note that by
default, the table ignores NA values.
table(penguins_clean$sex)
##
## female male
## 165 168
To count the NA values (if present), we can set the
useNA argument of table to
"ifany".
table(penguins_clean$sex, useNA = "ifany")
##
## female male <NA>
## 165 168 11
A relative frequency distribution summarizes the proportion or percentage of observation with each level of a categorical variable. To compute the relative frequency distribution of a variable, we must divide the frequency distribution by the number of observations.
# divide the frequence distribution of sex by the number of non-NA values
table(penguins_clean$sex)/sum(!is.na(penguins_clean$sex))
##
## female male
## 0.4954955 0.5045045
If we want to include the NA values in our table, we can
use the code below.
table(penguins_clean$sex, useNA = "ifany")/length(penguins_clean$sex)
##
## female male <NA>
## 0.47965116 0.48837209 0.03197674
We do not know the sex of approximately 3% of the
penguins observations.
summary functionThe summary function provides a simple approach for
quickly quantifying the center and spread of each numeric
variable in a data frame or determining the frequency distribution of a
factor variable.
summary(penguins_clean)
## species island bill_length
## adelie :152 Biscoe :168 Min. :32.10
## chinstrap: 68 Dream :124 1st Qu.:39.23
## gentoo :124 Torgersen: 52 Median :44.45
## Mean :43.92
## 3rd Qu.:48.50
## Max. :59.60
## NA's :2
## bill_depth flipper_length body_mass
## Min. :13.10 Min. :172.0 Min. :2700
## 1st Qu.:15.60 1st Qu.:190.0 1st Qu.:3550
## Median :17.30 Median :197.0 Median :4050
## Mean :17.15 Mean :200.9 Mean :4202
## 3rd Qu.:18.70 3rd Qu.:213.0 3rd Qu.:4750
## Max. :21.50 Max. :231.0 Max. :6300
## NA's :2 NA's :2 NA's :2
## sex
## female:165
## male :168
## NA's : 11
##
##
##
##
Visual summaries (i.e., plots) of data help us:
The two most popular packages for producing graphics in R are:
| Visual summary | Variable types | Summary type | Base functions | geoms |
|---|---|---|---|---|
| box plot | numeric |
univariate | boxplot |
geom_boxplot |
| histogram | numeric |
univariate | hist |
geom_histogram |
| density plot | numeric |
univariate | plot, density |
geom_density |
| bar plot | factor |
univariate | plot or barplot,
table |
geom_bar |
| scatter plot | 2 numeric |
bivariate | plot |
geom_point |
| parallel box plot | 1 numeric, 1 factor |
bivariate | plot or boxplot |
geom_boxplot |
| grouped scatter plot | 2 numeric, 1 factor |
multivariate | plot |
geom_point |
| facetted plots | mixed | multivariate | none | facet_wrap or facet_grid |
| interactive plots | mixed | multivariate | none | plotly::ggplotly |
There are 4 main components needed to produce a graphic using ggplot2.
ggplot object.
ggplot function.We add “layers” of information to a ggplot, such as
geoms, scales, or other customizations, using +.
A univariate plot is a plot that only involves a single variable.
A bar plot (or bar chart) displays the number or proportion of
observations in each category of a categorical variable (or using R
terminology, each level of a factor
variable).
The simplest way to create a bar plot in base R is using the
plot function on a factor.
plot(penguins$island, main = "distribution of island")
Alternatively, we can combine barplot with the
table function.
barplot(table(penguins_clean$sex, useNA = "ifany"),
names.arg = c("female", "male", "NA"))
To create a relatively frequency bar plot, we should divide the
results of table by the number of relevant
observations.
barplot(table(penguins_clean$sex, useNA = "ifany") /
length(penguins_clean$sex),
names.arg = c("female", "male", "NA"))
To create a bar plot with ggplot2, we first create a
basic ggplot object containing our data. Make sure to load
the ggplot2 package prior to creating the plot,
otherwise you’ll get errors!
# load ggplot2 package
library(ggplot2)
# create generic ggplot object with our data
gg_penguin <- ggplot(data = penguins_clean)
gg_penguin is a minimal ggplot object with
the raw information needed to produce future graphics.
To create a bar plot, we add the geom geom_bar and map
the species variable (in this example) to the
x aesthetic using the aes function.
# create bar plot for species variable
gg_penguin + geom_bar(aes(x = species))
A box plot indicates:
Outliers are usually marked with starts or dots.
Box plots:
The boxplot function is the easiest approach for
producing a box plot using base R.
boxplot(penguins_clean$body_mass,
data = penguins_clean,
main = "distribution of body mass")
Questions:
To create a box plot using ggplot2, we use
geom_boxplot.
gg_penguin + geom_boxplot(aes(y = bill_length))
## Warning: Removed 2 rows containing non-finite values
## (`stat_boxplot()`).
Questions:
A histogram: - displays the distribution of a numeric
variable. - counts the number of values falling into (usually)
equal-sized “bins”. - are used to assess skewness, modality (the number
of clear “peaks” in the plot), and to some extent, outliers.
The hist function is used create a histogram of a
numeric variable.
hist(penguins_clean$bill_length,
main = "",
xlab = "bill length (mm)",
breaks = 20)
Questions:
We use geom_histogram to create a histogram using
ggplot2.
gg_penguin + geom_histogram(aes(x = flipper_length))
## `stat_bin()` using `bins = 30`. Pick better value with
## `binwidth`.
## Warning: Removed 2 rows containing non-finite values
## (`stat_bin()`).
Question: Is the variable unimodal or bimodal?
A density plot is similar to a smoothed histogram.
The plot and density function can be
combined to construct a density plot using base R.
plot(density(penguins_clean$bill_depth,
na.rm = TRUE),
main = "")
Question: Is the variable unimodal or bimodal?
We create a density plot with ggplot2 using
geom_density.
gg_penguin + geom_density(aes(x = body_mass))
## Warning: Removed 2 rows containing non-finite values
## (`stat_density()`).
Questions:
body_mass variable is unimodal?A bivariate plot is a plot involving two variables.
Scatter plots can be used to identify the relationship between two
numeric variables.
We use the plot function to create a scatter plot.
# xlab and ylab are used to customize the x-axis and y-axis labels
plot(bill_length ~ body_mass,
data = penguins_clean,
xlab = "body mass (g)",
ylab = "bill length (mm)")
Questions:
body_mass and
bill length?The geom_point function can be used to create a scatter
plot with ggplot2.
gg_penguin +
geom_point(aes(x = bill_depth, y = bill_length))
## Warning: Removed 2 rows containing missing values
## (`geom_point()`).
Question: What can we conclude from this plot?
A parallel box plot is used to display the distribution of a
numeric variable whose values are grouped based on each
level of a factor variable. Parallel box plot
are useful for determining if the distribution of a numeric
variable substantially changes based on whether an observation has a
certain level of a factor.
plot(body_mass ~ sex, data = penguins_clean)
We can produce something similar with ggplot2 by
specifying both the y and x aesthetics of for
geom_boxplot.
gg_penguin + geom_boxplot(aes(x = species, y = bill_length))
## Warning: Removed 2 rows containing non-finite values
## (`stat_boxplot()`).
A multivariate plot displays relationships between 2 or more variables.
Multivariate plots are more easily created using ggplot2 than base.
A grouped scatter plot is a scatter plot that uses colors or symbols
(or both) to indicate the level of a factor
variable that each point corresponds to.
gg_penguin +
geom_point(aes(x = body_mass,
y = flipper_length,
color = species))
## Warning: Removed 2 rows containing missing values
## (`geom_point()`).
We use a colorblind-friendly palette and some additional customizations below.
gg_penguin +
geom_point(aes(x = body_mass,
y = flipper_length,
color = species,
shape = species)) +
scale_color_brewer(type = "qual",
palette = "Dark2") +
xlab("body mass (g)") +
ylab("flipper length (mm)") +
ggtitle("body mass versus flipper length by species")
## Warning: Removed 2 rows containing missing values
## (`geom_point()`).
Facetting creates separate panels (facets) of plots based on one or more facetting variables.
The key functions to do this with ggplot2 are:
facet_grid is used to create a grid of plots based on
one or two factor variables.facet_wrap wraps facets of panels around the plot based
on one factor variable..# simple facetting example
gg_penguin +
geom_point(aes(x = bill_depth, y = bill_length)) +
facet_grid(~ species)
## Warning: Removed 2 rows containing missing values
## (`geom_point()`).
How do we deal with NAs in facetting?
# facetting with NA facet!
gg_penguin +
geom_density(aes(x = body_mass)) +
facet_grid(~ sex)
## Warning: Removed 2 rows containing non-finite values
## (`stat_density()`).
# to remove NA facet, you must remove NAs
# only do this for relevant columns to retain
# as much data as possible
penguins_temp <-
penguins_clean |>
subset(select = c(body_mass, sex, species)) |>
na.omit()
# new plot from "clean" data
ggplot(penguins_temp) +
geom_density(aes(x = body_mass)) +
facet_grid(~ sex)
Here’s another facetting example that uses transparency.
ggplot(data = penguins_temp) +
geom_density(aes(x = body_mass, fill = sex),
alpha = 0.5) +
facet_grid(~ species)
The plotly package is an R package to
provide the capabilities of plotly [https://plotly.com/].
ggplotly function will instantly make a
ggplot interactive.# load plotly package
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
# create grouped scatter plot
ggi <-
gg_penguin +
geom_point(aes(x = body_mass, y = flipper_length,
color = species, shape = species)) +
scale_color_brewer(type = "qual", palette = "Dark2") +
xlab("body mass (g)") + ylab("flipper length (mm)") +
ggtitle("body mass versus flipper length by species")
# make plot interactive
ggplotly(ggi)
# create parallel box plots
ggi2 <-
gg_penguin +
geom_boxplot(aes(x = species, y = bill_length))
# make plot interactive
ggplotly(ggi2)
## Warning: Removed 2 rows containing non-finite values
## (`stat_boxplot()`).
str or dplyr::glimpse function to
get an idea of the initial structure. Look for problems with variable
names and types, etc.factor).summary function on your data frame. Take note
of NAs, impossible values that are data entry errors, etc.
Perhaps perform some additional cleaning based on this information.numeric variablesnumeric
variables.factor variables.numeric variables.numeric and
factor variables.numeric variable facetted by
the factor variable.numeric variables filled
with different colors by the factor variable.Correct erroneous data entries, when possible. If that’s not
possible, replace them with NA values.
What should you do about NAs?